Logistic Regression

University of San Francisco

Matt Meister

This Module

Thus far, we’ve talked about “linear” regression.

  • What made it linear?
    • Drawing a line through points

This Module

Today, we will talk about “logistic” regression.

  • Dependent variable can either be 1 or 0
    • Coefficients are different

Why Would We Draw That??

Why Would We Draw That??

After This Module

You’ll be able to perform, interpret, and communicate:

  • Logistic regression
    • Regression where the DV can be either 1 or 0
    • Interpreting coefficients is a little less fluent than with linear regression
  • Interactions in logistic regression
    • Whether and when the effect of one variable depends on the level of another

Run this code

rm(list = ls())
# List of packages
pkgs <- c('dplyr', 'ggplot2')

# Check for packages that are not installed
new.pkgs <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]

# Install the ones we need
if(length(new.pkgs)) install.packages(new.pkgs)

#Load them all in
lapply(pkgs, library, character.only = TRUE)

# Remove lists
rm(pkgs, new.pkgs)

age.data <- data.frame(
  age = c(27,30,32,33,35,40,44,45,50,58,59,60),
  buyer = c(0,0,1,0,0,1,0,1,1,1,1,1))

Logistic Regression

  • A type of regression analysis used to predict the probability of a binary outcome
    • Linear regression predicts the specific value of a continuous outcome
  • Outcome can be one of two categories:
    • yes/no
    • true/false
    • 0/1
    • Over 65/under 65
  • Logistic regression models the relationship between one or more predictor variables (features), and the probability of the binary outcome occurring.

Logistic Regression: When?

  • When we’re dealing with categorical dependent variables
    • Where the outcome falls into discrete categories
  • Examples?
    • Medical Diagnosis
      • Determining whether a patient has a specific disease based on various medical tests and patient characteristics.
    • Purchase likelihood
      • Predicting whether a customer will make a purchase based on their browsing history, demographics, and other factors
    • Credit Scoring
      • Assessing the likelihood of a loan default based on credit scores, income, and other financial indicators.

Differences Between Logistic and Linear Regression

  • Linear regression can predict less than 0 and greater than 1
    • Logistic regression can only predict between 0 and 1
  • Linear regression will usually make less certain predictions
    • Less likely to be almost 0 or almost 1
  • Coefficients are interpreted differently
    • Unit changes (linear) vs odds ratio (logistic)
  • Models are estimated differently
    • Least squares (linear) vs maximum likelihood (logistic)

Difference 1: Predictions are Between 0 and 1

  • What is the prediction for someone 60 years old?

lm(data = age.data,
   formula = buyer ~ age) %>%
  predict(newdata = 
            data.frame(age = 60))
      1 
1.08386 

glm(data = age.data,
   formula = buyer ~ age,
   family = 'binomial') %>%
  predict(newdata = 
            data.frame(age = 60),
          type = "response")
        1 
0.9855778 
  • In this case, you’ll hear the linear regression called a “linear probability model”
    • It says there is a 108.4% chance that someone 60 years old is a buyer
  • Logistic regression can’t go above 1 or below 0

Difference 1: Predictions are Between 0 and 1

  • What is the prediction for someone 20 years old?

lm(data = age.data,
   formula = buyer ~ age) %>%
  predict(newdata = 
            data.frame(age = 20))
          1 
-0.07678176 

glm(data = age.data,
   formula = buyer ~ age,
   family = 'binomial') %>%
  predict(newdata = 
            data.frame(age = 20),
          type = "response")
         1 
0.02532202 

Difference 2: Linear Regression Makes Less Certain Predicitons (Between 0 and 1)

  • Between 35 and 50, which line is farther from 50%?

  • Because logistic regression can bend, it is closer to 0 or 1
    • When it is between 0 and 1
  • Therefore, logistic regression makes more certain predictions (between 0 and 1)

Difference 3: Coefficients Are Interpreted Differently

lm(data = age.data,
   formula = buyer ~ age) %>%
  summary() %>% #Summarize the model
  coef() %>% # Take just coefficients
  round(4) # Round to 4 decimal places
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.6571     0.4530 -1.4504   0.1776
age           0.0290     0.0102  2.8327   0.0178
  • Liner coefficients are simpler
    • \(p = -.6571 + .029 \times age\)
    • Where:
      • p: Probability of being a buyer
  • One year increase in age?
    • 2.9% increase in likelihood someone is a buyer
glm(data = age.data, 
   formula = buyer ~ age,
   family = 'binomial') %>%
  summary() %>% #Summarize the model
  coef() %>% # Take just coefficients
  round(4) # Round to 4 decimal places
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -7.5879     4.2767 -1.7742   0.0760
age           0.1969     0.1099  1.7913   0.0732
  • Logistic regression coefficients are more complex
  • One year increase in age?
    • 0.1969 increase in the log-odds of being a buyer
    • \(\log \left( \frac{p}{1-p} \right) = -7.5879 + .1969 \times age\)
      • Where:
        • p: Probability of being a buyer

How log-odds relate to probabilities

  • This is not-examable
  • Log-odds are a way to express probabilities in a logarithmic scale
    • This allows us to get a result that is always between 0 and 1
  • Probability (p):
    • The likelihood of an event occurring, ranges from 0 (impossible) to 1 (certain).
  • Odds (O):
    • The odds of an event happening is the ratio of the probability of the event occurring to the probability of the event not occurring.
    • \(O = \frac{p}{1-p}\)
    • Odds can range from 0 to positive infinity.
  • Log-Odds (Logit): The log-odds (also known as the logit) is the natural logarithm of the odds. It’s mathematically represented as
    • \(\text{log}(\frac{p}{1-p})\)
    • Log-odds can range from negative infinity to positive infinity.

How log-odds relate to probabilities

  • Back to Probabilities
    • The log-odds transform the probability scale (0 to 1) to a scale that spans the entire real number line (negative infinity to positive infinity).
    • Advantages:
      • Linearity
        • In logistic regression, the relationship between predictor variables and log-odds is assumed to be linear.
        • This linear relationship makes it easier to model and analyze relationships between predictors and the outcome.
      • Interpretability
        • A one-unit change in a predictor variable leads to a constant change in log-odds, regardless of the initial probability level.
      • Symmetry
        • The log-odds are symmetric around 0.
        • This symmetry simplifies the mathematics in logistic regression.

Drawbacks of Log-Odds?

WTF WAS THAT???

Difference 3: Coefficients Are Interpreted Differently

What you need to know about coefficients:

  • Logistic regression coefficients are not the same as linear ones
    • They relate to some complicated math
    • That math makes the regression itself possible
    • But it makes the results hard to interpret
  • My recommendation is to:
    • Plot your data
    • Look at the z-value in your regression results
      • Like a t-value
    • Use the predict function to understand specific predictions
glm(data = age.data,
   formula = buyer ~ age,
   family = 'binomial') %>%
  predict(newdata = 
            data.frame(age = 20),
          type = "response")
         1 
0.02532202 

Differences Between Logistic and Linear Regression

  • Linear regression can predict less than 0 and greater than 1
    • Logistic regression can only predict between 0 and 1
  • Linear regression will usually make less certain predictions
    • Less likely to be almost 0 or almost 1
  • Coefficients are interpreted differently
    • Unit changes (linear) vs odds ratio (logistic)
  • Models are estimated differently
    • Least squares (linear) vs maximum likelihood (logistic)

My Take on Logistic Regression

  • Logistic regression is great for prediction
    • Precise between 0 and 1
    • Can’t give you an impossible number
    • MLE is very flexible
      • Not prone to errors we’ll talk about later
  • But interpretation/communication?
    • Not so much
  • I think linear probability models are often good enough
  • I am not expecting you to know it in great depth this semester
    • Let’s just get some practice, some comfort
    • Next semester we’ll focus on it more

Linear Probability Models are Often Good Enough

How do we interpret each result?

lm(data = age.data,
   formula = buyer ~ age) %>%
  summary() %>%
  coef() %>%
  round(4)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.6571     0.4530 -1.4504   0.1776
age           0.0290     0.0102  2.8327   0.0178

glm(data = age.data,
   formula = buyer ~ age,
   family = 'binomial') %>%
  summary() %>%
  coef() %>%
  round(4)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -7.5879     4.2767 -1.7742   0.0760
age           0.1969     0.1099  1.7913   0.0732

Linear Probability Models are Often Good Enough

  • For interpretation:
    • How do we interpret each result?
  • It’s the same!
    • Older people are more likely to be buyers
    • Significance is different, but not really

Linear Probability Models are Often Good Enough

  • For prediction:
    • How do our predictions differ?
    • Let’s say we have a new group of 10 customers
age.data.2 <- data.frame(
  age = c(25, 28, 30, 35, 40,
         48, 52, 60, 65, 65))
  • Linear regression
linear.reg <- lm(data = age.data, formula = buyer ~ age)
  • Predict buyer from new ages
age.data.2$buyer.linear <- predict(
  linear.reg,
  newdata = age.data.2,
  type = 'response'
)
  • But buyer has to be 0 or 1! So set it as such:

Linear Probability Models are Often Good Enough

  • But buyer has to be 0 or 1! So set it as such:
age.data.2$buyer.linear <- ifelse(
  age.data.2$buyer.linear >= .5, 1, 0
)
  • Let’s look at the result:
   age buyer.linear
1   25            0
2   28            0
3   30            0
4   35            0
5   40            1
6   48            1
7   52            1
8   60            1
9   65            1
10  65            1

Linear Probability Models are Often Good Enough

  • Logistic regression
logistic.reg <- glm(data = age.data, formula = buyer ~ age, family = 'binomial')
  • Predict buyer from new ages
age.data.2$buyer.logistic <- predict(
  logistic.reg,
  newdata = age.data.2,
  type = 'response'
)
  • But buyer has to be 0 or 1! So set it as such:

Linear Probability Models are Often Good Enough

  • But buyer has to be 0 or 1! So set it as such:
age.data.2$buyer.logistic <- ifelse(
  age.data.2$buyer.logistic >= .5, 1, 0
)
  • Let’s look at the result:
   age buyer.linear buyer.logistic
1   25            0              0
2   28            0              0
3   30            0              0
4   35            0              0
5   40            1              1
6   48            1              1
7   52            1              1
8   60            1              1
9   65            1              1
10  65            1              1

Linear Probability Models are Often Good Enough

  • What did you notice?
  • There was no difference!
    • What about if we had more data?

Linear Probability Models are Often Good Enough

  • Let’s simulate 1000 buyers
set.seed(101)
age.data.3 <- data.frame(
  age = runif(n = 1000,
              min = 20,
              max = 65))
  • Run the following:
    • Linear regression
    • Logistic regression
    • Predict from linear
    • Predict from logistic
    • Compare the differences

Linear Probability Models are Often Good Enough

  • Let’s simulate 1000 buyers
set.seed(101)
age.data.3 <- data.frame(
  age = runif(n = 1000,
              min = 20,
              max = 65))

linear.reg <- lm(data = age.data, formula = buyer ~ age)
logistic.reg <- glm(data = age.data, formula = buyer ~ age, family = 'binomial')

age.data.3$buyer.linear <- predict(
  linear.reg,
  newdata = age.data.3,
  type = 'response'
)

age.data.3$buyer.linear <- ifelse(
  age.data.3$buyer.linear >= .5, 1, 0
)

age.data.3$buyer.logistic <- predict(
  logistic.reg,
  newdata = age.data.3,
  type = 'response'
)

age.data.3$buyer.logistic <- ifelse(
  age.data.3$buyer.logistic >= .5, 1, 0
)

Linear Probability Models are Often Good Enough

  • Let’s simulate 1000 buyers
  • How often are our results different?
age.data.3$discrepency <- ifelse(
  age.data.3$buyer.linear == age.data.3$buyer.logistic, 0, 1
)

round( mean( age.data.3$discrepency), 3)
[1] 0.031
  • What’s another way I might want to look at this?
ggplot(age.data.3,
       aes( x = age,
            y = discrepency)) +
  geom_point()

Linear Probability Models are Often Good Enough

  • With whoever is beside you, come up with some ideas to answer:
    • When are linear probability models probably not “good enough”?
    • When will linear and logistic regressions give different results?

When will linear and logistic regressions give different results?

  • Far to the left and right of the x-axis?

When will linear and logistic regressions give different results?

  • Far to the left and right of the x-axis?
  • Only if we don’t round our linear answers down!
    • If we round, there is no difference

When will linear and logistic regressions give different results?

In the middle?

When will linear and logistic regressions give different results?

In the middle?

  • Yes!
  • Specifically when there is a strong flip

When will linear and logistic regressions give different results?

When the flip is very strong:

age.data <- data.frame(
  age = c(27,30,32,33,35,40,44,45,50,58,59,60),
  buyer = c(0,0,0,0,0,0,1,0,1,1,1,1))

When will linear and logistic regressions give different results?

When the flip is very strong:

  • Does this seem likely?
  • It’s not
  • You should be very curious if something flips like this
    • People are not lightbulbs

Logistic Regression: Takeaways

  • Regression where the DV can be either 1 or 0
  • Interpreting coefficients is a little less fluent than with linear regression
    • Log-odds instead of unit changes
      • Log-odds are mathematically convenient
      • Not so interpretable
    • Focus on z-scores, and predict()
  • Logistic regression is more computationally intensive
    • Maximum likelihood vs least squares

Logistic Regression: Takeaways

  • Logistic regression is more precise
    • Does not allow for > 1 or < 0
    • “S” shape allows more intermediate data points to predict 0 or 1
  • There is often not a meaningful difference between logistic and linear regression
    • So it’s good for prediction (where “rounding” can fold into bigger errors)
    • But might not be needed for communication
    • Run both, see if your results match!